Towards Automatic Grammar Acquisition from a Bracketed Corpus
نویسندگان
چکیده
1 I n t r o d u c t i o n Designing and refining a natural language grammar is a diiBcult and time-consuming task and requires a large amount of skilled effort. A hand-crafted grammar is usually not completely satisfactory and frequently fails to cover many unseen sentences. Automatic acquisition of grammars is a solution to this problem. Recently, with the increasing availability of large, machine-readable, parsed corpora, there have been numerous attempts to automatically acquire a CFG grammar through the application of enormous existing corporaILar90][Mi194][Per92][Shi95 ]. Lari and Young[Lar90] proposed so-called inside-outside algorithm, which constructs a grammar from an unbracketed corpus based on probability theory. The grammar acquired by this method is assumed to be in Chomsky normal form and a large amount of computation is required. Later, Pereira[Per92] applied this algorithm to a partially bracketed corpus to improve the computation time. Kiyono[Kiy94b][Kiy94a] combined symbolic and statistical approaches to extract useful grammar rules from a partially bracketed corpus. To avoid generating a large number of grammar rules, some basic grammatical constraints, local boundaries constraints and X bar-theory were applied. Kiyono's approach performed a refinement of an original grammar by adding some additional rules while the inside-outside algorithm tries to construct a whole grammar from a corpus based on Maximum Likelihood. However, it is costly to obtain a suitable grammar from an unbracketed corpus and hard to evaluate results of these approaches. As the increase of the construction of bracketed corpora, an attempt to use a bracketed (tagged) corpus for grammar inference was made by Shiral[Shi95]. Shirai constructed a Japanese grammar based on some simple rules to give a name (a label) to each bracket in the corpus. To reduce the grammar size and ambiguity, some hand-encoded knowledge is applied in this approach. In our work, like Shirai's approach, we make use of a bracketed corpus with lexical tags, but instead of using a set of human-encoded predefined rules to give a name (a label) to each bracket, we introduce some statistical techniques to acquire such label automatically. Using a bracketed corpus, the grammar learning task is reduced to the problem of how to determine the nonterminal label of each bracket in the corpus. More precisely, this task is concerned with the way to classify brackets to some certain groups and give each group a label. We propose a method to group brackets in
منابع مشابه
Automatic Extraction of Japanese Grammar from a Bracketed Corpus
In recent years, numerous attempts have been devoted to derive Context Free Grammars (CFGs) by using a large corpus. In this paper, we describe a method to extract a Probabilistic Context Free Grammar (PCFG) of Japanese from a bracketed corpus, and propose two methods to improve it. The experiments show that the extracted PCFG has a 94 % accept rate, 85 % brackets recall and 75 % brackets preci...
متن کاملAutomatic Detection of Syllable Boundaries Combining the Advantages of Treebank and Bracketed Corpora Training
An approach to automatic detection of syllable boundaries is presented. We demonstrate the use of several manually constructed grammars trained with a novel algorithm combining the advantages of treebank and bracketed corpora training. We investigate the effect of the training corpus size on the performance of our system. The evaluation shows that a hand-written grammar performs better on findi...
متن کاملGrammar Acquisition and Statistical Parsing by exploiting Local Contextual Information
This paper presents a method for inducing a context-sensitive conditional probability context-free grammar from an unlabeled bracketed corpus using local contextual information and describes a natural language parsing model which uses a probabilitybased scoring function of the grammar to rank parses of a sentence. This method uses clustering techniques to group brackets in a corpus into a numbe...
متن کاملGrammar Acquisition Based on Clustering Analysis and Its Application to Statistical Parsing
This paper proposes a new method for learning a context-sensitive conditional probability context-free grammar from an unlabeled bracketed corpus based on clustering analysis and describes a natural language parsing model which uses a probability-based scoring function of the grammar to rank parses of a sentence. By grouping brackets in a corpus into a number of similar bracket groups based on ...
متن کاملKEY WORDS-Statistical Parsing, Grammar Acquisition, Clustering Analysis, Local Contextual
This paper proposes a new method for learning a context-sensitive conditional probability context-free grammar from an unlabeled bracketed corpus based on clustering analysis and describes a natural language parsing model which uses a probability-based scoring function of the grammar to rank parses of a sentence. By grouping brackets in a corpus into a number of similar bracket groups based on ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996